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Abstract 


We show that kernel-based quadrature rules for computing integrals can be seen as a special case of 
random feature expansions for positive definite kernels, for a particular decomposition that always 
exists for such kernels. We provide a theoretical analysis of the number of required samples for 
a given approximation error, leading to both upper and lower bounds that are based solely on the 
eigenvalues of the associated integral operator and match up to logarithmic terms. In particular, we 
show that the upper bound may be obtained from independent and identically distributed samples 
from a specific non-uniform distribution, while the lower bound if valid for any set of points. 
Applying our results to kernel-based quadrature, while our results are fairly general, we recover 
known upper and lower bounds for the special cases of Sobolev spaces. Moreover, our results 
extend to the more general problem of full function approximations (beyond simply computing an 
integral), with results in L2- and Loo-norm that match known results for special cases. Applying 
our results to random features, we show an improvement of the number of random features needed 
to preserve the generalization guarantees for learning with Lipshitz-continuous losses. 


1. Introduction 

The numerical computation of high-dimensional integrals is one of the core computational tasks 
in many areas of machine learning, signal processing and more generally applied mathematics, 
in particular in the context of Bayesian inference (Gelman, 2004), or the study of complex sys¬ 
tems (Robert and Casella, 2005). In this paper, we focus on quadrature rules, that aim at ap¬ 
proximating the integral of a certain function from only the (potentially noisy) knowledge of the 
function values at as few as possible well-chosen points. Key situations that remain active areas 
of research are problems where the measurable space where the function is defined on is either 
high-dimensional or structured (e.g., presence of discrete structures, or graphs). For these prob¬ 
lems, techniques based on positive definite kernels have emerged as having the potential to effi¬ 
ciently deal with these situations, and to improve over plain Monte-Carlo integration (O’Hagan, 
1991; Rasmussen and Ghahr'amani, 2003; Huszar and Duvenaud, 2012; Oates and Girolami, 2015). 
In particular, the quadrature problem may be cast as the one of approximating a fixed elemenf, fhe 
mean elemenf (Smolaef ah, 2007), of a Hilberf space as a linear combination of well chosen el- 
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ements, the goal being to minimize the number of these factors as it corresponds to the required 
number of function evaluations. 

A seemingly unrelated problem on positive definite kernels have recently emerged, namely the 
representation of the corresponding infinite-dimensional feature space from random sets of features. 
If a certain positive definite kernel between two points may be represented as the expectation of 
the product of two random one-dimensional (typically non-linear) features computed on these two 
points, the full kernel (and hence its feature space) may be approximated by sufficiently many 
random samples, replacing the expectation by a sample average (Neal, 1995; Rahimi and Recht, 
2007; Huang et ah, 2006). When using these random features, the complexity of a regular kernel 
method such as the support vector machine or ridge regression goes from quadratic in the number 
of observations to linear in the number of observations, with a constant proportional to the number 
of random features, which thus drives the running time complexity of these methods. 

In this paper, we make the following contributions: 

- After describing the functional analysis framework our analysis is based on and presenting 
many examples in Section 2, we show in Section 3 that these two problems are strongly related; 
more precisely, optimizing weights in kernel-based quadrature rules can be seen as decompos¬ 
ing a certain function in a special class of random features for a particular decomposition that 
always exists for all positive definite kernels on a measurable space. 

- We provide in Section 4 a theoretical analysis of the number of required samples for a given 
approximation error, leading to both upper and lower bounds that are based solely on the eigen¬ 
values of the associated integral operator and match up to logarithmic terms. In particular, we 
show that the upper bound may be obtained as independent and identically distributed samples 
from a specific non-uniform distribution, while the lower bound if valid for any set of points. 

- Applying our results to kernel quadrature, while our results are fairly general, we recover known 
upper and lower bounds for the special cases of Sobolev spaces (Section 4.4). Moreover, our 
results extend to the more general problem of full function approximations (beyond simply 
computing an integral), with results in L 2 - and Loo-norm that match known results for special 
cases (Section 5). 

- Applying our results to random feature expansions, we show in Section 4.5 an improvement of 
the number of random features needed for preserving the generalization guarantees for learning 
with Lipshitz-continuous losses. 


2. Random Feature Expansions of Positive Definite Kernels 

Throughout this paper, we consider a topological space X equipped with a Borel probability mea¬ 
sure dp, which we assume to have full support. This naturally defines the space of square-integrable 
functions^. 


1. For simplicity and following most of the literature on kernel methods, we identify functions and their equivalence 
classes for the equivalence relationship of being equal except for a zero-measure (for dp) subset of X. 
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2.1 Reproducing kernel Hilbert spaces and integral operators 

We consider a continuous positive definite kernel A: : X x X —)• M, that is a symmetric function 
such that for all finite families of points in X, the matrix of pairwise kernel evaluations is positive 
semi-definite. This thus defines a reproducing kernel Hilbert space (RKHS) 3" of functions from X 
fo M, which we also assume separable. This RKHS has fwo importanf characferisfic properties (see, 
e.g., Berlinef and Thomas-Agnan, 2004): 

(a) Membership of kernel evaluations', for any x G X, fhe function k{-,x) : y i-)- k{y,x) is an 
elemenf of T. 

(b) Reproducing property, for all / G T and x G X, f{x) = {f,k{-,x))y. In ofher words, 
function evaluations are equal fo dof-producfs wifh a specific elemenf of T. 

Moreover, fhroughouf fhe paper, we assume fhaf fhe function x k{x, x) is infegrable wifh respecf 
fo dp (which is weaker fhan sup^-gj k{x, x) < oo). This implies fhaf T is a subsef of L 2 {dp)', fhaf 
is, funcfions in fhe RKHS T are all square-infegrable for dp. In general, 3" is sfricfly included in 
L 2 {dp), buf, in fhis paper, we will always assume fhaf if is dense in L 2 {dp), fhaf is, any function 
in L 2 {dp) may be approximafed arbifrarily closely by a function in 3“. Finally, for simplicify of 
our nofafion (fo make sure fhaf fhe sequence of eigenvalues of infegral operafors is infinife) we will 
always assume fhaf L 2 {dp) is infinife-dimensional, which excludes finife sefs for X. Nofe fhaf fhe 
lasf fwo assumptions (denseness and infinife dimensionalify) can easily be relaxed. 

Integral operator. Reproducing kernel Hilbert spaces are often studied thr'ough a specific integral 
operator which leads to an isometry with L 2 {dp) (Smale and Cucker, 2001). Let S : L 2 {dp) 
L 2 {dp) be defined as 

(S/)(x) = [ f{y)k{x,y)dp{y). 

Jx 

Since k{x, x)dp{x) is finite, S is self-adjoint, positive semi-definite and trace-class (Simon, 
1979). Given that S/ is a linear combination of kernel functions k{-,y), it belongs to 3". More 
precisely, since we have assumed that 3"is dense in L 2 {dp), which is the unique positive self- 
adjoint square root of S, is a bijection from L 2 {dp) to our RKHS 3"; that is, for any f £ 3^, there 
exists a unique g G L 2 {dp) such that / = and ||/||m = IbllLjfdp) (Smale and Cucker, 2001). 
This justifies the notation f for / G 3“ and means that is an isometry from L 2 {dp) to 3"; 
in other words, for any functions / and g in 3", we have: 

This justifies the view of 3" as the subspace of functions / G L 2 {dp) such that 

This relationship is even more transparent when considering a spectral decomposition of S. 

Mercer decomposition. From extensions of Mercer’s theorem (Kdnig, 1986), there exists an or¬ 
thonormal basis (em)m^i of L 2 {dp) and a summable non-increasing sequence of strictly positive 
eigenvalues {pm)m^i such that Scm = Pm^m- Note that since we have assumed that 3“ is dense in 
L 2 {dp), there are no zero eigenvalues. 
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Since is an isometry from L 2 {dp) to 3", (/rm^em)m 3 si is an orthonormal basis of S'. Moreover, 
we can use the eigendecomposition to characterize elements of 3" as the functions in L 2 {dp) such 
that 

is finite. In other words, once projected in the orthonormal basis {em)m^i, elements / of 3“ corre¬ 
spond to a certain decay of its decomposition coefficients ((/, e^)L 2 {dp))m'^i- 

Finally, by decomposing the function k{-,y) : x i-)- k{x, y), we obtain the Mercer decomposition: 

k{x,y) = ^ Pmeni{x)em{y)- 
m^l 


Properties of the spectrum. The sequence of eigenvalues {pm)m^i is an important quantity that 
appears in the analysis of kernel methods (Hastie and Tibshirani, 1990; Caponnetto and De Vito, 
2007; Harchaoui et ah, 2008; Bach, 2013; El Alaoui and Mahoney, 2014). It depends both on the 
kernel k and the chosen distribution dp. 

Some modifications of the kernel k or the distribution dp lead to simple behaviors for the spectrum. 
For example, if we have a second distribution so that ^ is upper-bounded by a constant c, then, as 
a consequence of the Courant-Fischer minimax theorem (Horn and Johnson, 2012), the eigenvalues 
for dp' are less than than c times that the ones for dp. Similarly, if the kernel k' is such that ck — k' 
is a positive definite kernel, then we have a similar bound between eigenvalues. 

In this paper, for any strictly positive A, we will also consider the quantity m*{X) equal to the 
number of eigenvalues pm that are greater than or equal to A. Since we have assume that the 
sequence m is non-increasing, we have m*{X) = max{m ^ 1, pm ^ A}. This is a left-continuous 
non-increasing function, that tends to -|-oo when A tends to zero (since we have assumed that there 
are infinitely many strictly positive eigenvalues), and characterizes the sequence {pm)m^i, as we 
can recover pm as pm = sup{A ^ 0, m*{X) ^ m}. 

Potential confusion with covariance operator. Note that the operator S is a self-adjoint operator 
on L 2 {dp). It should not be confused with the (non-centered) covariance operator C (Baker, 1973), 
which is a self-adjoint operator on a different space, namely the RKHS 3“, defined by {g, C f)y = 
fx f(x)g(x)dp(x). Given fhaf is an isomefry from L 2 {dp) fo 3", the operator C may also be 
used to define an operafor on L 2 {dp), which happens fo be exacfly S. Thus, fhe fwo operafors have 
fhe same eigenvalues. Moreover, we have, for any y £ X: 

iCf){y) = {k{-,y),Cf)x = [ k{x,y)f{x)dp{x) = (S/)(y), 

Jx 

fhaf is, C is equal fo fhe resfricfion of S on 3". 

2.2 Kernels as expectations 

On top of the generic assumptions made above, we assume that there is another measurable set V 
equipped with a probability measure dr. We consider a function : V x X —M which is square- 
integrable (for the measure dT®dp), and assume that the kernel k may be written as, for all x,y £ X: 
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k{x,y)= / ^{v,x)ip{v,y)dT{v) = {^p{■,x),^p{■,y))L^^^dT)■ 

In other words, the kernel between x and y is simply the expectation of ip{v, x)ip{v, y) for v fol¬ 
lowing the probability distribution dr. In this paper, we see x ^p{v, x) € M as a one-dimensional 
random feature and (^(u, x)ip{v, y) is the dot-product associated with this random feature. We could 
consider extensions where ^p{v, x) has values in a Hilbert space (and not simply M), but this is out¬ 
side the scope of this paper. 

Square-root of integral operator. Such additional structure allows to give an explicit character¬ 
ization of the RKHS Ik in terms of the features y?. In terms of operators, the function y? leads to 
a specific square-root of the integral operator S defined in Section 2.1 (which is nof fhe posifive 
self-adjoinl square-roof 

We consider fhe bounded linear operator T : L 2 {dT) —)• L 2 {dp) defined as 

{Tg){x) = / g{v)ip{v,x)dT{v) = {g,^i-,x))L^^ar)- (2) 

Jv 

Given T : L 2 (dr) —>• L 2 {dp), fhe adjoinf operator T* : L 2 {dp) —)■ L 2 {dT) is fhe unique operator 
such thaf {g, T*/) 2 , 2 (rfr) = 9i f)L 2 {dp) for fi 9- Given the definition of T in Eq. (2), we simply 
inverse the role of V and X and have: 

{T*f){v)= [ f{x)(p{v,x)dp{x). 

Jx 

This implies by Fubini’s theorem that 

{TT*f){y) = j^(^j^f{x)(f{v,y)dp{x)^(p{v,x)dT{v) 

= [ fix)( [ ^piv,y)^p{v,x)dT{v))dp{x) = [ f{x)k{x,y)dp{x) = {T.f){y), 

Jx \Jv J Jx 

that is we have an expression of the integral operator S as S = TT*. Thus, the decomposition of the 
kernel k as an expectation corresponds to a particular square root T of the integral operator—there 
are many possible choices for such square roots, and thus many possible expansions like Eq. (1). 
It turns out that the positive self-adjoint square root will correspond to the equivalence with 
quadrature rules (see Section 3.2). 

Decomposition of functions in 3“. Since S = TT* and is an isometry between L 2 {dp) 
and T, we can naturally expressed any elements of 3“ through the operator T and thus the features ip. 

As a linear operator, T defines a bijecfion from the orthogonal of its null space (Ker T)-*- C L 2 (dr) 
to its image Im(T) C L 2 {dp), and this allows to define uniquely T~^ f G (Ker T)-*- for any 
/ G Im(r), and a dof-producf on Im(T) as 


(f,k)lm(T) = (T-^f,T-^9}L2(dr)- 
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As shown by Bach (2014, App. A), Im(T) turns out to be equal to our RKHS^. Thus, the norm 
ll/llgr for / G T is equal to the squared L2-norm of T~^f G (Ker T)-*-, which is itself equal to the 
minimum of ||5||^2(dr) 9 '^9 ~ f- resulting g may also be defined through 

pseudo-inverses. 

In other words, a function / G L 2 {dp) is in 3“ if and only if it may be written as 

Vx G X, f{x) = / g{v)ip{v,x)dT{v) = {g,ip{-,x))L 2 {dr), 

Jv 

for a certain function : V —)• M such that 1151112(dr) finite, with a norm ||/||j equal to the 
minimum (which is always attained) of ||5|||2((iT)’ possible decompositions of /. 

Singular value decomposition. The operator T is an Hilbert-Schmidt operator, to which the sin¬ 
gular value decopomposition can be applied (Kato, 1995). That is, there exists an orthonormal basis 
{fm)m^i of (Ker T)-*- C L2(dr), together with the orthonormal basis (em)m^i of L2{dp) which 
we have from the eigenvalue decomposition of S = TT*, such that Tfm = Pm &m- Moreover, we 
have: 

p{v,x) = ^ p]l^em{x)fm{v), (3) 

with a convergence in L2(dr (8) dp). This extends the Mercer decomposition of the kernel k{x,y). 


Integral operator as an expectation. Given the expansion of the kernel k in Eq. (1), we may 
express the integral operator S as follows, explicitly as an expectation: 

s/ = / f{y)k{-,y)dpiy) = [ [ f{y)ip{v,-)ip{v,y)dp{y)dT{v) 

Jx Jx Jv 

= ‘P(x, ■)(‘p(v, ■), f}L2(dp)dT(v} = ( ^ •) <^L 2 (dp) ‘pIx, ■}dT(v)^ /, (4) 

where a ®L2{dp) b is the operator L2{dp) L2{dp) so that (a ^L2idp) b)f = {b, f)L2{dp)a- This 
will be useful to define empirical versions, where the integral over dr will be replaced by a finite 
average. 


2.3 Examples 

In this section, we provide examples of kernels and usual decompositions. We first start by decom¬ 
positions that always exist, then focus on specific kernels based on Fourier components. 


Mercer decompositions. 

follows: 

k{x,y) 


The Mercer decomposition provides an expansion for all kernels, as 




tr S 


(tr S)^/2em(x) 




2. The proof goes as follows: (a) for any y £ X, k{-, y) can be expressed as <p(v, y)(fi(v, ■)dT{v) = Tip{-, y) and 
thus belongs to Im(T); (b) for any / G ImfT), and t/ G X, we have (/, fc(-, ?/))im(T) = (T" V. Pk, y))LpdT) = 
{TT~^ f){y) = f{y), that is, the reproducing property is satisfied. These two properties are characteristic of T. 
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which can be transformed in to an expectation with V = N*. In Section 3.2, we provide another 
generic decomposition with V = X. Note that this decomposition is typically impossible to compute 
(except for special cases below, i.e., special pairs of kernels k and distributions dp). 

Periodic kernels on [0,1]. We consider X = [0,1] and translation-invariant kernels k{x,y) of 
the form k{x, y) = t{x — y), where t is a square-integrable 1-periodic function. These kernels are 
positive definite if and only if the Fourier series of t is non-negative (Wahba, 1990). An orthonormal 
basis of T2([0, 1]) is composed of the constant function cq : x i-)- 1 and the functions Cm '■ x ^ 
v/2 cos 27rmx and Sm '■ x ^ y/2 sin Ittuix. A kernel may thus be expressed as 

k{x, y) = lyoco(x) -h ^ [cm(x)Cm(y) + Sm(x)Sm(y)] = t'O + 2 ^ 1 /^ COS 27rm(x - y). 

m>0 m>0 

This can be put trivially as an expectation with V = Z and leads to the usual Fourier features 
(Rahimi and Recht, 2007). This is also exactly a Mercer decomposition for k and the uniform 
distribution on [0,1], with eigenvalues uo and Um, m > 0 (each of these with multiplicity 2). The 
associated RKHS norm for a function / is then equal to 

[ fi^)dx') -|-2 2 /“^ l" f f (x) cos 27 Tmxdx\ f f{x) sm 27 rmxdx'\ 

Wo ^ ^>0 LWo ^ Wo ^ 1 

A particularly interesting example is obtained through derivatives of /. If / is differentiable and 
has a derivative /' G T2([0,1]), then, on the Fourier series coefficients of /, taking the derivative 
corresponds to multiplying the two m-th coefficients by 27rm and swapping them. Sobolev spaces 
for periodic functions on [0,1] (i.e., such that /(O) = /(I)) are defined through integrability of 
derivatives (Adams and Fournier, 2003). In the Hilbert space set-up, a function / belongs to the 
Sobolev space of order s if one can define a s-th order square-integrable derivative in L 2 (for the 
Lebesgue measure, which happens to be equal to dp), that is, G T2([0,1]). The Sobolev squared 
norm is then defined as any positive linear combination of the quadratic forms f^^\x)‘^dx, t G 
{0,..., s}, with non-zero coefficients for f = 0 and t = s (all of these norms are then equivalent). 
If only using t = 0 and t = s with non-zero coefficients, we need = 1 and = 1 -|- 
An equivalent (i.e, with upper and lower bounded ratios) sequence is obtained by replacing Um = 
(1 -I- by Um = leading to a closed-form formula: 

kix, y) = l+ ^ \ \ > B2si{x - y}), 

(2s)! 

where {x—y} denotes the fractional part of x—y, and B2S is the 2s-th Bernoulli polynomial (Wahba, 
1990). The RKHS 3“ is then the Sobolev space of order s on [0,1], with a norm equivalent to any of 
the family of Sobolev norms; it will be used as a running example throughout this paper. 

Extensions to [0, l]"^. In order to extend to d > 1, we may consider several extensions as described 
by Oates and Girolami (2015), and compute the resulting eigenvalues of the integral operators. For 
simplicity, we consider the Sobolev space on [0,1], with uq = I and 1 /“^ = for m > 0. The first 
possibility to extend to [0,1]*^ is to take a kernel which is simply the pointwise product of individual 
kernels on [0,1]. That is, if k{x,y) is the kernel on [0,1], define K{X,Y) = 0^=1 f/j) 
between X and Y in [0,1]'^. As shown in Appendix A, this leads to eigenvalue decays bounded by 
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m and thus up to logarithmic terms at the same speed m as d = 1. While 
this sounds attractive in terms of generalization performance, it corresponds to a space a function 
which is not a Sobolev space in d dimensions. That is the associated squared norm on / would be 
equivalent to a linear combination of squared L 2 -norm of partial derivatives 

^ 

for all ti,..., in {0,..., s}. This corresponds to functions which have square-integrable partial 
derivatives with all individual orders less than s. All values of s ^ 1 are allowed and lead to an 
RKHS. 

This is thus to be contrasted with the usual multi-dimensional Sobolev space which is composed 
of functions which have square-integrable partial derivatives with all orders (fi,..., td) with sum 
ti + ■ ■ ■ + td less than s. Only s > d/2 is then allowed to get an RKHS. The Sobolev norm is then 
of the form 

In the expansion on the d-th order tensor product of the Fourier basis, the norm above is equivalent 
to putting a weight on the element (mi,..., nid) asymptotically equivalent to ( J2j=i which 

thus represent the inverse of the eigenvalues of the corresponding kernel for the uniform distribution 
dp (this is simply an explicit Mercer decomposition). Thus, the number of eigenvalues which are 
greater than A grows as the number of (mi,..., m^) such that their sum is less than which 

itself is less than a constant times (see a proof in Appendix A). This leads to an eigenvalue 

decay of which is much worse because of the term in 1/d in the exponent. 

Translation invariant kernels on W^. We consider X = and translation-invariant kernels 
k{x,y) of the form k{x,y) = t{x — y), where t is an integrable function from to M. It is 
known that these kernels are positive definite if and only if the Fourier transform of t is always a 
non-negative real number. More precisely, if t{io) = t{x)e~^^ ^dx G M+, then 

k{x, y) = , [ i{uj)e^ = , [ /(w) rcosa;"''xcosa;"''y-|-sincc"''xsina;'''ylda;. 

(2vr) jRd (27r) J-^d 

Following Rahimi and Recht (2007), by sampling w from a density proportional to f(a;) G M+ and 
b uniformly in [0,1] (and independently of tc), then by defining V = x [0,1] and (/?(a;, b, x) = 
s/2cos{ui^X + 27rb), we obfain fhe kernel k. 

For fhese kernels, fhe decay of eigenvalues has been well-sfudied by Widom (1963), who relafes 
fhe decay of eigenvalues fo fhe fails of fhe disfribufion dp and fhe decay of fhe Fourier Iransform 
of t. For example, for fhe Gaussian kernel where k{x,y) = exp(—a||a: — ylH), on sub-Gaussian 
disfribufions, fhe decay of eigenvalues is geomefric, and for kernels leading fo Sobolev spaces of 
order s, such as fhe Mafern kernel (Furrer and Nychka, 2007), fhe decay is of fhe form See 

also examples by Birman and Solomyak (1977); Harchaoui el al. (2008). 

Finally, nofe fhaf in ferms of compulation, Ihere are exlensions fo avoid linear complexify in d (Le el ah, 
2013). 






Kernels on hyperspheres. If X C is the d-dimensional hypersphere {x G ||x||2 = 

1}, then specific kernels may be used, of the form k{x, y) = t{x^y), where t has to have a posi¬ 
tive Legendre expansion (Smola et ah, 2001). Alternatively, kernels based on neural networks with 
random weights are directly in the form of random features (Cho and Saul, 2009; Bach, 2014): 
for example, the kernel k{x,y) = E(u^x)^(u^x)^ for v uniformly distributed in the hyper¬ 
sphere corresponds to sampling weights in a one-hidden layer neural network with rectified linear 
unifs (Cho and Saul, 2009). If furns ouf fhaf fhese kernels have a known decay for fheir specfrum. 

As shown by Smola ef al. (2001); Bach (2014), fhe equivalenf of Fourier series (which corresponds 
fo d = 1) is fhen fhe basis of spherical harmonics, which is organized by infeger frequencies 
k ^ 1; insfead of having 2 basis vecfors (sine and cosine) per frequency, fhere are 0{k'^~^) of 
fhem. As shown by Bach (2014, page 44), we have an explicif expansion of k[x, y) in terms of 
spherical harmonics, leading to a sequence of eigenvalues equal to on the entire subspace 

associated with frequency k. Thus, by taking multiplicity into account, after ~ 

(up to constants) eigenvalues, we have an eigenvalue of this leads to an eigenvalue 

decay (where all eigenvalues are ordered in decreasing order and we consider the m-th one) as 


2.4 Approximation from randomly sampled features 

Given the formulation of k as an expectation in Eq. (1), it is natural to consider sampling n elements 
ui,..., G V from the distribution dr and define fhe kernel approximation 


Hx,y) = - V(/?(ui,x)</?(?;i,y), 

n ^ 


2 = 1 


(5) 


which defines a finife-dimensional RKHS T. 


From fhe sfrong law of large numbers—which can be applied because we have fhe finife expecfafion 
E|(/3 (u, x)ip{v, y)\ ^ (E|(/?(t!,x)pE|(/?(r;, y)\‘^)^^^, when n lends lo infinily, k{x, y) lends lo k{x, y) 
almosf surely, and fhus we gel as fighf as desired approximations of fhe kernel k, for a given pair 
{x,y) G X X X. Rahimi and Rechl (2007) show lhal for Iranslalion-invaiianl kernels on a Euclidean 
space, fhen fhe convergence is uniform over a compacf subsel of X, wilh fhe Iradilional rale of 


convergence of 


log n 


In Ibis paper, we ralher consider approximations of functions in 3“ by funcfions in 3", fhe RKHS 
associated wilh k. A key difficulfy is fhaf in general T is nof even included in 3", and fherefore, 
we cannof use fhe norm in 3" lo characterize approximalions. In Ihis paper, we choose fhe L 2 -norm 
associafed wilh fhe probabilily measure dp on X lo characferize fhe approximation. Given / G 3“ 
wilh norm H/Hj- less lhan one, we look for a function / G T of fhe smallesf possible norm and so 
fhaf 11/ — /lUjfrfp) small as possible. 

Note lhal Ihe measure dr is associated lo fhe kernel k and Ihe random fealures p, while Ihe mea¬ 
sure dp is associated lo Ihe way we wanl lo measure errors (and leads lo a specific definlion of Ihe 
integral operator S). 
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Computation of error. Given the definition of the Hilbert space 3" in terms of y? in Section 2.2, 
given g e L 2 (dr) with ||p||L 2 (rfT) ^ 1 and f{x) = g(v)(p(v, x)dT(v), we aim at finding an 

element of fT close to /. We can also represent ^ through a similar decomposition, now with a finite 
number of features, i.e., through a € M” such that / = ') with norm^ ||/|||. ^ n,||a||| 

as small as possible and so that the following approximation error is also small: 


11 / - fWliidp) 


n 

i=l 


g{v)(p{v,-)dT{v) 


L2{dp) 


( 6 ) 


Note that with a* = -g{vi) and Vi sampled from dr (independently), then, we have E(||q;|| 2 ) = 
ELilEa? = ^ ^ and an expected error E(||/ - fWl^^ap)) = '^Whidp) ^ 

^ sup^gV ||(^(u, •)lli 2 (dp)i onr goal is to obtain an error rate with a better scaling in n, by (a) choos¬ 
ing a better distribution than dr for the points vi,... ,Vn and (b) by finding fhe besf possible weighfs 
a G (fhaf should of course depend on fhe function g). 


Goals. We fhus aim af sampling n poinfs vi,... ,Vn G V from a disfribufion wifh density q wifh 
respecf fo dr. Then fhe kernel approximafion using importance weights is equal fo 


k{x,y) = 

n ^ q{vi) 


ip{vi,x)p{vi,y) 


(so fhaf fhe 
minimizing 

norm of fhe 
as possible. 


aw of large numbers leads fo an approximafion converging fo k), and we fhus aim af 


Ta=i ■)-fv 9 {v)^{v, ■)dT{v) 


L2{dp) 


, wifh n\\ 


(which represenfs fhe 


approximafion in T because of our imporfance weighfs are faken info accounf) as small 


3. Quadrature in RKHSs 

Given a square-infegrable (wifh respecf fo dp) function p : X —)• R, fhe quadrafure problem aims af 
approximating, for all h infegrals 



h{x)g{x)dp{x) 


by linear combinations 

n 

'^aih{xi) 

i=l 

of evaluations h{xi),..., h[xn) of fhe function h af well-chosen poinfs xi,..., G X. Of course, 
coefficienfs a G R” are allowed fo depend on g (fhey will in linear fashion in fhe nexf secfion), buf 
nof on h, as fhe so-called quadrafure rule has fo be applied fo all functions in 3“. 

3. Note the factor n because our finite-dimensional kernel in Eq. (5) is an average of kernels and not a sum. 
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3.1 Approximation of the mean element 


Following Smola et al. (2007), the error may be expressed using the reproducing property as: 

n « , n « 

y^aih{xi)- / h{x)g{x)dp{x) = (h,y^aik{-,Xi) - / k{-,x)g{x)dp{x] 

Jx \ Jx 


and by Cauchy-Schwarz inequality its supremum over ||/i|| 3 r ^ 1 is equal to 


n 

'^aik{-,Xi) 

i=l 


k{-,x)g{x)dp{x) 


(V) 


The goal of quadrature rules formulated in a RKHS is thus to find points xi,... € X and 

weights a G M” so that the quantity in Eq. (7) is as small as possible (Smola et ah, 2007). For 
p = 1 , the function k(-, x)dp(x) is usually referred to as the mean element of the distribution dp. 

The standard Monte-Carlo solution is to consider xi,...,Xn sampled i.i.d. from dp and the weights 
ctj = g{xi)/n, which leads to a decrease of the error in with lE||a||2 ^ ^ and an expected 

squared error which is equal to ^E\\g{v)k{:,x)\\'^ ^ ^sup^-gx a:) (Smola et ah, 
2007). Note that when g = I, Eq. (7) corresponds to a particular metric between the distribution dp 
and its corresponding empirical distribution (Sriperumbudur et ah, 2010). 

In this paper, we explore sampling points Xi from a probability distribution on X with density q with 
respect to dp. Note that when 5 is a constant function, it is sometimes required that the coefficients a 
are non-negative and sum to a fixed constant (so that constant functions are exactly integrated). We 
will not pursue this here as our theoretical results do not accommodate such constraints (see, e.g., 
Chen et ah, 2010; Bach et ah, 2012, and references therein). 


Tolerance to noisy function values. In practice, independent (but not necessarily identically dis¬ 
tributed) noise Si may be present with variance cr^(xj). Then, the worst (with respect to ll/ilb ^ 1 ) 
expected (with respect to the noise) squared error is 


inf E 
hy^i 


10 p z 

y^ai{h{xi) + Ei) - / h{x)g{x)dp{x) 

7^1 

n p 2 'IT' 

y^aik{-,Xi) - / k{-,x)g{x)dp{x) -h Y] Q;^o-^(xi), 


and thus in order to be robust to noise, having a small weighted ^2-norm for the coefficients a G M” 
is important. 


3.2 Reformulation as random features 

For any x G X, the function k{-,x) is in X, and since we have assumed that Y}I‘^ is an isometry 
from L 2 {dp) to X, there exists a unique element, which we denote ?/)(•,x), of L 2 {dp) such that 
SV 2 ^(. , x) = k{-,x). Given the Mercer decomposition A:(-, x) = /rmem(x)em^ we have 
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^ /2 

the expansion = '^^>1 Gm{x)em (with convergence in the L 2 -norm for the measure 

1 /2 

dp ® dp', note that we do not assume that ppi is summable), and thus we may consider ■0 as a 
symmetric function. Note that 'll; may not be easy to compute in many practical cases (except for 
some periodic kernels on [0,1]). 

We thus have for (x, y) € X x X: 

k{x,y) = {k{-,x),k{-,y)) 3 r = (S^/^'0(-, x), 2/))s- = {'^P{-,x),'ilj{-,y))L 2 (dp) 

because of the isometry property of 

= / i^iv,x)'ilj{v,y)dp{v). (8) 

Jx 


That is, the kernel k may always be written as an expectation. Moreover, we have the quadrature 
error in Eq. (7) equal to (again using the isometry from L 2 {dp) to X): 


K.,x)g{x)dp{x) 


2=1 


n « 

1^1 

n « 

- / i}{x,-)g{x)dp{x) 


2=1 


L2{dp) 


which is exactly an instance of the approximation result in Eq. (6) with V = X and (p = ijj, that is 
the random feature is indexed by the same set X as the kernel. Thus, the quadrature problem, that 
is finding points Xj and weights (oj) to get the best possible error over all functions of the unit ball 
of 3“, is a subcase of the random feature problem for a specific expansion. Note that this random 
decomposition in terms of ifj is always possible (although not in closed form in general). 

Interpretation through square-roots of intergral operators. As shown in Section 2.2, random 
feature expansions correspond to square-roots of the integral operator S : L 2 {dp) —>• L 2 {dp) as 
S = TT*. Among the many possible square roots, the quadrature case corresponds exactly to 
the positive self-adjoint square root T = In this situation, the basis {fm)m^i of the singu¬ 

lar value decomposition of T = is equal to (em)m^i, recovering the expansion 'ilj{x,y) = 
e.m{x)em{v) which we have seen above. 

Translation-invariant kernels on [0,1]'^ or X = In this important situation, we have two 
different expansions: the one based on Eourier features, where the random variable indexing the 
one-dimensional feature is sl frequency, while for the one based on the square root ?/), the random 
variable is a spatial variable in X. As we show in Section 4, our results are independent of the cho¬ 
sen expansions and thus apply to both. However, (a) when the goal is to do quadrature, we need to 
use and (b) in general, the decomposition based on Eourier features can be easily computed once 
samples are obtained, while for most kernels, 'f{x, y) does not have any closed-form simple expres¬ 
sion. In Section 6, we provide a simple example with X = [0,1] where the two decompositions are 
considered. 


Goals. In order to be able to make the parallel with random feature approximations, we consider 
importance-weighted coefficients /3j = aiq(xi)^/^, and we thus aim at minimizing the approxima- 
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tion error 


y^diq{xi) ^/‘^k{-,Xi)- / k{-,x)g{x)dp{x) 

“T JX 


We consider potential independent noise with variance a‘^{xi) ^ T‘^q{xi) for all Xi, so that the 
tolerance to noise is characterized by the ^ 2 -iiorm ||/3||2- 

3.3 Relationship with column sampling 

The problem of quadrature is related to the problem of column sampling. Given n observations 
xi,...,Xn G 3C, the goal of column-sampling methods is to approximate the n x n matrix of 
pairwise kernel evalulations, the so-called kernel matrix, from a subset of its columns. It has ap¬ 
peared under many names: Nystrdm method (Williams and Seeger, 2001), sparse greedy approxi¬ 
mations (Smola and Schdlkopf, 2000), incomplete Cholesky decomposition (Fine and Scheinberg, 
2001), Gram-Schmidt orthonormalization (Shawe-Taylor and Cristianini, 2004) or CUR matrix de¬ 
compositions (Mahoney and Drineas, 2009). 

While column sampling has typically been analyzed for a fixed kernel matrix, it has a natural ex¬ 
tension which is related to quadrature problems: selecting n points xi,..., from X such that the 
projection of any element of the RKHS 3“ onto the subspace spanned by k{-,Xi), z = 1,..., n is as 
small as possible. Natural functions from T are k{-,x), x G X, and thus the goal is to minimize, for 
such X G X, 



2 = 1 


In the usual sampling approach, several points are considered for testing the projection error, and it 
is thus natural to consider the criterion averaged through the measure dp, that is: 



In fact, when dp is supported on a finite set, this formulation is equivalent to minimizing the nuclear 
norm between the kernel matrix and its low-rank approximation. There are thus several differences 
and similarities between recent work on column sampling (Bach, 2013; El Alaoui and Mahoney, 
2014) and the present paper on quadrature rules and random features: 

- Different error measures: The column sampling approach corresponds to a function g in 
Eq. (7) which is a Dirac function at the point x, and is thus not in L 2 {dp). Thus the two 
frameworks are not equivalent. 

- Approximation vs. prediction: The works by Bach (2013); El Alaoui and Mahoney (2014) 
aim at understanding when column sampling leads to no loss in predictive performance within 
a supervised learning framework, while the present paper looks at approximation properties, 
mostly regardless of any supervised leaiwing problem, except in Section 4.5 for random features 
(but not for quadrature). 

- Lower bounds: In Section 4.3, we provide explicit lower bounds of approximations, which are 
not available for column sampling. 
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- Similar sampling issues; In the two frameworks, points xi,..., £ X are sampled i.i.d. with 

a certain distribution q, and the best choice depends on the appropriate notion of leverage 
scores (Mahoney, 2011), while the standard uniform distribution leads to an inferior approxima¬ 
tion result. Moreover, the proof techniques are similar and based on concentration inequalities 
for operators, here in Hilbert spaces rather in finite dimensions. 


3.4 Related work on quadrature 

Many methods have been designed for the computation of integrals of a function given evaluations 
at certain well-chosen points, in most cases when g is constant equal to one. We review some of 
these below. 

Uni-dimensional integrals. When the underlying set X is a compact interval of the real line, 
several methods exists, such as the trapezoidal or Simpson’s rules, which are based on interpolation 
between the sample points, and for which the error decays as 0(l/n^) and 0(l/n^) for functions 
with uniformly bounded second or fourth derivatives (Cruz-Uribe and Neugebauer, 2002). 

Gaussian quadrature is another class of methods for one-dimensional integrals: it is based on a basis 
of orthogonal polynomials for L 2 {dp) where dp is a probability measure supported in an interval, 
and their zeros (Hildebrand, 1987, Chap. 8). This leads to quadrature rules which are exact for 
polynomials of degree 2n — 1 but error bounds for non-polynomials rely on high-order derivatives, 
although the empirical performance on functions of a Sobolev space in our experiments is as good 
as optimal quadrature schemes (see Section 6); depending on the orthogonal polynomials, we get 
various quadrature rules, such as Gauss-Legendre quadrature for the Lebesgue measure on [0,1]. 

Quasi Monte-carlo methods employ a sequence of points with low discrepancy with uniform weights 
(Morokoff and Caflisch, 1994), leading to approximation errors of 0(l/n) for univariate functions 
with bounded variation, but typically with no adaptation to smoother functions. 

Higher-dimensional integrals. All of the methods above may be generalized for products of 
intervals [0,1]'^, typically with d small. For larger problems, Bayes-Hermite quadrature (O’Hagan, 
1991) is essentially equivalent to the quadrature rules we study in this paper. 

Some of the quadrature rules are constrained to have positive weights with unit sum (so that the 
positivity properties of integrals are preserved and constants are exactly inegrated). The quadrature 
rules we present do not satisfy these constraints. If these constraints are required, kernel herd¬ 
ing (Chen et ah, 2010; Bach et ah, 2012) provides a novel way to select a sequence of points based 
on the conditional gradient algorithm, but with currently no convergence guarantees improving over 
0(1/V^) for infinite-dimensional spaces. 

Theoretical results. The best possible error for a quadrature rule with n points has been well- 
studied in several settings; see Novak (1988) for a comprehensive review. For example, for X = 
[0,1] and the space of Sobolev functions, which are RKHSs with eigenvalues of their integral op¬ 
erator decreasing as Novak (1988, Prop. 2 and 3, page 38) shows that the best possible 

quadrature rule for the uniform distribution and g = 1 leads to an error rate of n“®, as well as for 
any squared-integrable function g. The proof of these results (both upper and lower bounds) relies 
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on detailed properties of Sobolev spaces. In this paper, we recover these results using only the de¬ 
cay of eigenvalues of the associated integral operator S, thus allowing straightforward extensions to 
many situations, like Sobolev spaces on manifolds such as hyperspheres (Hesse, 2006), where we 
also recover existing results (up to logarithmic terms). 

Moreover, Novak (1988, page 17) shows that adaptive quadrature rules where points are selected 
sequentially with the knowledge of the function values at previous points cannot improve the worst- 
case guarantees. Our results do not recover this lower bound result for adaptivity. 

Finally, Langberg and Schulman (2010) consider multiplicative errors in computing integrals and 
mainly focuses on different function spaces, such as ones used in clustering functionals. Although 
sampling quadrature points from a well-chosen density is common in the two approaches, the anal¬ 
ysis tools are different. It would be interesting to see if some of these tools can be transferred to our 
RKHS setting. 


From quadrature to function approximation and optimization. The problem of quadrature, 
uniformly over all functions g € L 2 {dp) that define the integral, is in fact equivalent to the full 
approximation of a function h given values at n points, where the approximation error is character¬ 
ized in L 2 -norm. Indeed, given the observations h{xi), f = 1,..., n, we build ctih{xi) as an 
approximation of g(x)h(x)dp(x). It turns out that the coefficients are linear in g, that is, there 
exists tti G L 2 (dp) such that a* = {at, g)L 2 {dp)- This implies that Ya=i h{xi){ai,9)L2{dp) is an 
approximation of {h, g)L 2 (dp)- Thus, the worst case error with respect to g in the unit ball of L 2 {dp) 
is II d{xi)ai — /i||^ 2 (dp)’ approximation result of h through observations 

of its values at certain points. 

Novak (1988) considers the approximation problem in Loo-norm and shows that for Sobolev spaces, 
going from L 2 - to Lqo- norms incurs a loss of performance of ^/n. We recover partially these results 
in Section 5 from a more general perspective. When optimizing the points at which the function 
is evaluated (adaptively or not), the approximation problem is often referred to as experimental 
design (Cochran and Cox, 1957; Chaloner and Verdinelli, 1995). 

Finally, a third problem is of interest (and outside of the scope of this paper), namely the problem 
of finding fhe minimum of a funcfion given (pofenfially noisy) function evaluations. For noiseless 
problems, Novak (1988, page 26) shows that the approximation and optimization problems have 
the same worst-case guarantees (with no influence of adapfivify); fhis opfimizafion problem has 
also been sfudied in fhe bandit setting (Srinivas et al., 2012) and in the framework of “Bayesian 
optimization” (see, e.g. Bull, 2011). 


4. Theoretical Analysis 

In this section, we provide approximation bounds for the random feature problem outlined in Sec¬ 
tion 2.4 (and thus the quadrature problem in Section 3). In Section 4.1, we provide generic upper 
bounds, which depend on the eigenvalues of the integral operator S and present matching lower 
bounds (up to logarithmic terms) in Section 4.3. The upper-bound depends on specific disfribu- 
fions of samples fhaf we discuss in Section 4.2. We then consider consequences of these results on 
quadrature (Section 4.4) and random feature expansions (Section 4.5). 
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4.1 Upper bound 


The following proposition (see proof in Appendix B.l) determines the minimal number of samples 
required for a given approximation accuracy: 


Proposition 1 (Approximation of the unit ball of 30 For A > 0 and a distribution with positive 
density q with respect to dr, we consider 

<imax(g,A) =sup-^((/?(u, •),(S + A/)~V(''^r))L2(dp)- (9) 

fSV Q\0) 

Let vi,... ,Vn be sampled i. i. d. from the density q, then for any 5 € (0,1), if 

\ f 16dmax(^, A) 

n ^ 5dmax(Q', A) log - - - , 

0 

with probability greater than l — 5,we have A Ollijldp) ^ 

n 2 


sup 



i=l 


^ 4A. 

L2{dp) 


We can interpret the proposition above as follows: given any squared error 4A > 0 and a distribution 
with density q, the number n of samples from q needed so that the unit ball of 3“ is approximated by 
the ball of radius 2 of 3“ is, up to logarithmic terms, at most a constant times dmax(^) A), defined in 
Eq. (9). The result above is a statement for a fixed q and A and fhis number of samples n depends 
on fhese. 

We could also inverf fhe relationship befween A and n, fhaf is, answer fhe following question: given 
a fixed number n of samples, whaf is fhe approximation error A? This requires inverting fhe funcfion 
A I—)• dmax(<?j A). This will be done in Secfion 4.2 for a specific disfribufion q where fhe expression 
simplifies, fogefher wifh specific examples from Section 2.3. 

Finally, nofe fhaf we also have a bound on A ')\\L^{^dpy which shows fhaf our 

random functions are nof foo large on average (fhis consfrainf will be needed in fhe lower bound as 
well in Secfion 4.3). 


Sketch of proof. The proof fechnique relies on compufing an explicif candidafe /3 G M" obfained 
from minimizing a regularized leasf-squares formulation 


inf II y] Piq{vi) •) - / 


2 = 1 


La (dp) 


+ n. 


If fums ouf fhaf fhe final bound on fhe squared error is exacfly proporfional fo fhe regularizafion 
paramefer A. As shown in Appendix B.l, fhis leads fo an approximation / which is a linear function 
of /, as / = (S + A/)“^S/, where S is a properly defined empirical infegral operator and A > 0 
is fhe regularizafion paramefer. Then, Bernsfein concenfrafion inequalifies for operators (Minsker, 
2011) can be used in a way similar fo fhe work of Bach (2013); El Alaoui and Mahoney (2014) on 
column sampling, fo provide a bound on all desired quanfifies. 
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Result in expectation. In Section 4.5, we will need a result in expectation. As shown at the end 

of Appendix B.l, as soons as, A ^ (tr Il)/4 and n ^ 5(iinax(A) log ^ 

A 


E( sup 


inf 








i=l 


L2(dp) 


^ 8A. 


4.2 Optimized distribution 

We may now consider a specific distribution that depends on the kernel and on A, namely 

w , ^ •), (S + A/)-V(u, ■))L2{dp) ^ (y?(u, •), (S + AJ)-V(u, ■))L2{dp) 

^ •), (S + \I)-^ip{v, ■))L 2 {dp)dT{v) tr S(S + A/)-i 

for which dmax('?A5"^) = d{X) = trS(S + A/)~^. With this distribution, we thus need to have 
n ^ 5d(A) log with d{\) = tr S(S + A/)“^ is the degrees of freedom, a traditional quantity 
in the analysis of least-squares regression (Hastie and Tibshirani, 1990; Caponnetto and De Vito, 
2007), which is always smaller than dmax(l) A) and can be upper-bounded explicitly for many 
examples, as we now explain. The computation of dmax(l) A) in the operator setting (for which 
we may use q = 1), a quantity often referred to as the maximal leverage score (Mahoney, 2011), 
remains an open problem. 

The quantity d{X) only depends on the integral operator S, that is, for all possible choices of square 
roots, i.e., all possible choices of feature expansions, the number of samples that our results guar¬ 
antee is the same. This being said, some expansions may be more computationally practical than 
others, and when using the distribution with q{v) = 1, the bounds will be different. 

Expression in terms of singular value decomposition. Given the singular value decomposition 

1 /2 

of ip in Eq. (3), we have, for any u G V, ip{v, •) = XlmSsi dm fm{v)em and thus 
ql{v) CX {ip{v, •), (S -f XI)-^ip{v, ■))L 2 (dp) = X] 

which provides an explicit expression for the density q^ 

For a given squared error value A, the optimized distribution while leading to the degrees of 
freedom that will happen to be optimal in terms of approximation, has two main drawbacks: 

- Dependence on A: this implies that if we want a reduced error (i.e., a smaller A), then the 
samples obtained from a higher A, may not be reused to provably obtain the desired bound; in 
other words, the sampling is not anytime. For specific examples, e.g., quadrature with periodic 
kernels on [0,1] with the uniform distribution, then g = 1 happens to be optimal for all A, and 
thus, we may reuse samples for different values of the error. 

- Hard to compute in practice: the optimal distribution depends on a leverage score {ip{v, •), (S-|- 
XI)~^ip{v, •))i/ 2 (rfp)’ which may be hard to use for several reasons; first, it requires access to the 
infinite-dimensional operator S, which may be difficult; moreover, even if it possible to invert 
S -I- XI, the set V might be particularly large and impractical to sample from. At the end of 
Section 4.1 , we propose a simple algorithm based on sampling. 
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Eigenvalues and degrees of freedom. In order to relate more directly to the eigenvalues of S, we 
notice that we may lower bound the degrees of freedom by a constant times the number m* (A) of 
eigenvalues greater than A: 

d{\) = tr S(S + A/)”^ = ^ ^ ^ ^max({m, n^n > A}) = m*(A), 

^^1 + X //m + A 2 

as defined in Section 2.1. 

Moreover, we have the upper-bound: 

dW = ^ ' + X] ^ max({m, + \ 

fim + A ^ Aim + A A ^ 

We now make the assumption that there exists a7 > 0 independent of j such that 

CX) 

Vj ^1, ^ Aim ^ TJAij- (11) 

m=j 

This assumption essentially states that the eigenvalues decay sufficiently homogeneously and is 
satisfied by Atm oc wifh 7 = (2a — 1)“^, Hm oc r”* wifh 7 = (1 — r)“^ and similar bounds 

also hold for all examples in Secfion 2.3. If allows us fo relate fhe degrees of freedom direcfly fo 
eigenvalue decays. 

Indeed, fhis implies fhaf ^ 7max({m, A^m ^ A}) = m*(A) for all A ^ /ii (fhe 

largesl eigenvalue) and fhus 

^m*(A) ^ d ^ [1 + 7]m*(A). 

We can now resfafe fhe approximafion resulf of Prop. 1 from Secfion 4.1 wifh fhe opfimized disfri- 
bufion (see proof in Appendix B.2): 


Proposition 2 (Approximation of the unit ball of T for optimized distribution) For A > 0 and 

the distribution with density q\ defined in Eq. (10) with respect to dr, with degrees of freedom d{X). 
Let vi,... ,Vn be sampled i.i.d.from the density q, defining the kernel (and its associated RKHS 3") 
kx,y) = p(vi,x)p(vi,y). Then, for any 6 E (0,1), with probability 1 — 6, we have: 


sup inf 

Wfy^l ll/llg.<2 


11/-/Him 


< 4A, 


under any of the following conditions: 


(a) j/n ^ 5 d(A) log [l6d(A)/5], 

(b) ifEq. (11) is satisfied, and, by choosing m ^ and A = pm- 

o(l+7j log 


The statement (a) above, is a simple corollary of Prop. 1, and goes from level of error A to minimum 
number n of samples. The statement (b) goes in the other direction, that is, from the number of 
samples n to the achieved approximation error. It depends on the eigenvalues pm of the integral 
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operator taken at m = 0(n/log(n)). For example, for polynomial decays of eigenvalues of the 
form /im = we get (non squared) errors proportional to (logn)^n“® for n samples, 

while for geometric decays, we get geometric errors as a function of the number n of samples. 

Note however that for the statement (b) to hold, we need to sample the points vi,... ,Vn from 
the distribution that is, for different numbers of samples n, the distribution is unfortunately 
different (except in special cases). It would be interesting to study the properties of independent 
but not identically distributed samples vi,... ,Vn and the possibility of achieving the same rate 
adaptively. 

Corollary for Sobolev spaces. For the sake of concreteness, we consider the special case of X = 
and translation-invariant kernels. We assume that the distribution dp is sub-Gaussian. Then for 
Sobolev spaces of order s, the eigenvalue decay is proportional to Thus, if we can sample 

from the optimized distribution, after n random features, we obtain an approximation of the unit ball 
of T with error independently of the chosen expansion, the spatial one used for quadrature 

or the spectral one used in random Fourier features. For kernels in W^, these distributions are not 
readily computed in closed form and need to computed through a dedicated algorithm such as the 
one we present below. 

The same approximation results holds for translation-invariant kernels on [0, l]'^; but when dp is the 
uniform distribution, as shown in Section 4.4, the optimized distribution for the quadrature case is 
still the uniform distribution, for all values of A, and can thus be computed. 

Algorithm to estimate the optimized distribution. We now consider a simple algorithm for 
estimating the optimized distribution It is based on using a large number N of points vi,... ,vn 
from dr, and replacing dr by a potentially weighted empirical distribution df associated with these 
N points. Therefore, we may use any set of points and weights, which leads to a distribution close 
to dr. In full generality, only random samples from dr are readily available (with weights 1/A^), 
but for special cases, such as V = [0,1] or V = N*, we may use deterministic representations. See 
examples in Section 6. 

We thus assume that we have pairs {vi, Pi) G VxM_|_, i = 1,..., N, such that Vi = 1- Since 
df has a finite support with at most N elements, we may identify L 2 {df) and (with its canonical 
dot-product), and the operator T goes now from to L 2 {dp), with Tg = gip{vi, •) G 

L 2 {dp), with T6i = ■) G L 2 {dp), for Si the i-th element of the canonical basis of M^. 

Then, we have: 

{ipivi,■),i^ + ^Ir"f’{v^r))L,idp) = vy{TS,,{TT* + XI)-^TSi)L,^^,) 

= py{T6^,T{T*T + XI)-^6,)L,idp) 

= py{T*T{T*T + XI)-^)^^. 

This implies that the density of the optimized distribution with respect to the uniform measure on 
{vi,... ,vj\f} is proportional to (x*T{T*T + XI)~^)... We can then sample any number n of 
points from resampling from {r ;!,... ,vj\f} from the density above. The computational complexity 
is 0{N^). A detailed analysis of the approximation properties of this algorithm is outside the scope 
of this paper. 
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We have {T*T)ij = J^ip{vi,x)ip{vj,x)dp{x). In some cases, it can be computed in closed 

form—such as for quadrature where this is equal to k{vi,Vj). In some others, it requires 

i.i.d. samples xi,..., xm from dp, and the estimate: Y^^=i ^{vi,Xk)p{vj,Xk). 


4.3 Lower bound 


In this section, we aim at providing lower-bounds on the number of samples required for a given 
accuracy. We have the following result (see proof in Appendix B.3): 


Proposition 3 (Lower approximation bound) For 6 G (0,1), if we have a family V'l, ■ ■ ■, 'i/’n £ 
L 2 {dp) such that 


1 X ^ 

^ 2trS/<5, 


and 


sup inf 

ii/iixo 


f AV'i 

i=l 


2 

^ 4A, 

L2{dp) 


then n ^ 


max{m, pm ^ 144A} 


4 log 


10 tr E 
A5 


We can make the following observations: 


- The proof technique not surprisingly borrows tools from minimax estimation over ellipsoids, 
namely the Varshamov-Gilbert’s lemma. 

- We obtain matching upper and lower bounds up to logarithmic terms, using only the decay 
of eigenvalues {pm)m^i of the integral operator S (of course, if sampling from the optimized 
distribution is possible). Indeed in that case, as shown in Prop. 2, we have shown that we 
need at most 10 d{X) log [2(i(A)], where d(A) is the degrees of freedom, which is upper and 
lower bounded by a constant times m*{X) = max{m, pm ^ A}. 

- In order to obtain such a bound, we need to constrain both ||/3||2 and the norms of the vectors r/)*, 
which correspond to bounded features for the random feature interpretation and tolerance to 
noise for the quadrature interpretation. We choose our scaling to match the constraints we have 
in Prop. 1 , for which the parameter 6 ends up entering the lower bound logarithmically. 


4.4 Quadrature 

We may specialize the results above to the quadrature case, namely give a formulation where the 
features ip do not appear (or equivalently using defined in Section 3.2). This is a special case 
where V = X and p = ijj. In terms of operators T in Section 2.2, this corresponds to T = 

Optimized distribution. Following Section 4.1, we have an expression for the optimized distri¬ 
bution, both in terms of operators, as follows, 

ql{x) cx (t/>(x, •), (S -h A/)"V(a;, ■))L 2 {dp) = •), (S -f A/)"^S"fr2A:(x, ■))L 2 (dp), 

and in terms of eigenvalues and eigenvectors of k, that is, 

q{x) oc {k{-,x), S"fr2(S -h A/)"^S"fr2/c(-, x))i 2 (rfp) = ^ , em{x)^- (12) 
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While this is uniform in some special cases (uniform distribution on [0,1] and Sobolev kernels, as 
shown below), this is typically hard to compute and sample from. An algorithm for approximating 
it was presented at the end of Section 4.1. 

A weakness of our result is that in general our optimized distribution depends on A and thus 
on the number of samples. In some cases with symmetries (i.e., uniform distribution on [0,1] or 
the hypersphere), happens to be constant for all A. Note also that we have observed empirically 
that in some cases, converges to a certain distribution when A tends to zero (see an example in 
Section 6). 

Sobolev spaces. For Sobolev spaces with order s in [0, l]'^ or (for which we assume d < 2 s), 
the decay of eigenvalues is of the form and thus the error after n samples is (up to 

logarithmic terms), which recovers the upper and lower bounds of Novak (1988, pages 37 and 38) 
(also up to logarithmic terms). 

For the special case of Sobolev spaces on [0,1]'^ with dp the uniform distribution, the optimized 
distribution in Eq. (12) is also the uniform distribution. Indeed, the eigenfunctions of the integral 
operator S are d-th order tensor products of the uni-dimensional Fourier basis (the constant and 
all pairs of sine/cosine at a given frequency), with the same eigenvalue for the 2'^ possibilities of 
sines/cosines for a given multi-dimensional frequency (mi,... ,md)- Therefore, when summing 
all squared values of the eigenfunctions corresponding to (mi,. .., nid), we end up with the sum 
X^aeloql-^ ni=i cos^““ (27rmiXi) sin^(^ “')(27rmiXj), which ends up being constant equal to one 
(and thus independent of x) because cos^“*(27rmiXj) -|- sin^“*(27rmiXj) = 1. 

Finally, we may consider Sobolev spaces on the hypersphere, with the kernels presented in Sec¬ 
tion 2.3. As shown by Bach (2014, Appendix D.3), the kernel k{x,y) = for v 

uniform on the hypersphere, leads to a Sobolev space of order t = s + while the decay of 
eigenvalue of the integral operator was shown to be in Section 2.3. It is thus equal to 

and we recover the result from Hesse (2006). 


Quadrature rule. We assume that points xi,..., are sampled from the distribution with den¬ 
sity q with respect to dp. The quadrature rule for a function /i £ T is qlxip ^^ ' compute /3, 
we need to minimize with respect to /3 the error: 


E 



k{-,x)g{x)dp{x) 


2 

+ n\\\l3\\l, 




which is the regularized worst case squared error in the estimation of the integral of h over /i £ T. 
The best error is obtained for A = 0, but our guarantees are valid for A > 0, with an explicit control 
over the norm ||/3|||, which is important for robustness to noise. 

Given the values of k(xi, x)g(x)dp(x} = Zi, for i = 1,, n, which can be computed in closed 
form for several triplet {k, g, dp) (see, e.g., Smola et ah, 2007; Oates and Girolami, 2015), then the 
problem above is equivalent to minimizing with respect to (d: 


EE 


q(Xi)V2g,(a;j)l/2 


n 

k{xi,Xj) - ^ 
i=l 


A 

g(Xi)V2 


Zi + nX\\/3\\ 


2 

2 ) 
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which leads to a n x n linear system with running time complexity O(n^). Note that when adding 
points sequentially (in particular for kernels for which the distribution is independent of A, such 
as Sobolev spaces on [0,1]), one may update the solution so that after n steps, the overall complexity 
is 0{n^). 


Approximation of functions in S'. With the quadrature weights /3 estimated above and the quadra¬ 
ture rule qlx^y ^/2 for th® estimation of g{x)f{x)dp{x), we may derive an expression which 
is explicitly linear in g. Following the proof of Prop. 1 in Appendix B.l, we have, when specialized 
to the quadrature case: 

1 . ^ 1 / 1 ^ ^ 1 \ 


Moreover, we have /?* = „g(^^.)i /2 (fc(^ a:^), S + ^I) ^J:^/^g)L^^dp) from Eq. (15) in Ap¬ 

pendix B.l, and the quadrature rule becomes: 


Pih{xi) 

^ q(xi)V2 


/ 1 1 \ 

{h,-Y, ^-^^[k{xi, •) k{x„ •)] S-V2(s + XI)-^J:^/^g ) 

\ ri q(Xij / 


which can be put in the form {h,g)i^(^dp) with the approximation h = S^/^S(S -|- \I) 
of the function h ^ 3^. Having a bound for all functions g such that ||fl'||L 2 (dp) ^ 1 is equivalent to 
having a bound on ||/i — /i 11^2(dp)- Iri Section 5, we consider extensions, where we consider other 
norms than the L 2 -norm for characterizing the approximation error h — h. Moreover, we consider 
cases where h belongs to a strict subspace of 3 (with improved results). 


4.5 Learning with random features 


We consider supervised learning with m i.i.d. samples from a distribution on inputs/outputs (x, y), 
and a uniformly G-Lipschitz-continuous loss function £{y,-), which includes logistic regression and 
the support vector machine. We consider the empirical risk L{f) = — YllLi the 

expected risk L{f) = E£(y, f{x)), with x having the marginal distribution dp that we consider in 
earlier sections. We assume that Kk{x,x) = tr S = R^. We have the usual generalization bound 
for the minimizer / of L{f) with respect to ||/|| j ^ F, based on Rademacher complexity (see, e.g., 
Shalev-Shwartz and Ben-David, 2014): 


Km] 


^ inf L(f) + 2E 


sup |L(/)-L(/)| 






inf 

WfWx^F 


m + 


4FGR 

y/m 


(13) 


We now consider learning by sampling n features from the optimized distribution from Section 4.2, 
leading to a function parameterized by /3 e M”', that is gp = ') S L 2 {dp). 


22 












Applying results from Section 4.1 , we assume that A ^ ii2/4 and n ^ 5d{X) log where 

d{X) is equal to the degrees of freedom associated with the kernel k and distribution dp. Thus, the 
expected squared error for approximating the unit-ball of 3“ by the ball of radius 2 of the approxi¬ 
mation T obtained from the approximated kernel is less than 8 A. 

If we consider the estimator j3 obtained by minimizing the empirical risk of gp subject to ||/ 3||2 ^ 
2F!y/n. We have the following decomposition of the error for any 7 e such that II 7 II 2 ^ 
2Fjy/n and / G T such that ||/||5 ^ F\ 

Hap) = 


~ + L{g^) — L{g.y) -|- L{g^) — L{g^) + L{g^) — L{f) -|- L{f) 

2 sup \L{gp^) - L{gpi)\ + [T(p^) - L(/)] -f L{f) 

'-||/3'||g-72F/V7 J 


sup \L{gpf) - L{gi 3 i)\ 

||gr72F/7n 


-h sup inf \L{gy) — L{f)] + inf L{f). 

||/'||^7Fll7ll2<2F/77L Wfy^F 


We now take expectation with respect to the data and the random features. Following standard 
results for Rademacher complexities of ^ 2 -balls (Bartlett and Mendelson, 2003, Lemma 22), the 
first term is less than 


m^/n q{vi) ^ ^ 

'' 2 = 1 J = 1 ^ ' 


4FG 


mwn 


{nm tr 


4:FGR 

y/m 


Because of the G-Lipschitz-continuity of the loss, we have L{gj) — L{f') ^ G\\gj) — f'\\L 2 {dp)^ 
and thus the second term is less than y/^GF ^ ‘iGFy/X. Overall, we obtain 

^ AFGE 

]E[L( 5 ,)] ^ inf L(/) + 3 GFx^+^^. 

L P J II/IL7F y/m 

If we consider X = E?/mm order to lose only a constant factor compared to Eq. (13), we have the 
constraint n ^ f>d{B?‘/m) log \2md{B?‘/m)\. 

We may now look at several situations. In the worst case, where the decay of eigenvalue is not fast, 
i.e., very close to 1/z, then we may only use the bound d( A) = tr S(S-|-A/)“^ ^ A“^trS = B?/X, 
and thus a sufficient condition n ^ 10m log 2m, and we obtain the same result as Rahimi and Recht 
(2009). 

However, when we have eigenvalue decays as we get (up to constants), following the 

same computation as Section 4.2, d{X) ^ and thus n ^ logm, which is a 

significant improvement (regardless of the value of F). Moreover, if the decay is geometric as r®, 
then we get d{X) ^ log(ii^/A), and thus n ^ (log m)^, which is even more significant. 


5. Quadrature-related Extensions 

In Section 4.4, we have built an approximation h = -|- AI)“^S“^/^/i of a function h G 3^, 

which is based on n function evaluations h{xi), ... , h{xn). We have presented in Section 4.4 a 
convergence rate for the L 2 -norm \\h — /i||L2(dp) for functions h with less than unit T-norm ||/i|| j ^ 
1. Up to logarithmic terms, if using the optimal distribution for sampling xi,..., Xn, then we get a 
squared error of pn where pn is the n-th largest eigenvalue of the integral operator S. 
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Robustness to noise. We have seen that if the noise in the function evaluations h{xi) has a vari¬ 
ance less than q{xi)T^, then the error \\h — additional term r^||/3||2 ^ Hence, 

the amount of noise has to be less than in order to incur no loss in performance (a bound which 
decreases with n). 

Adaptivity to smoother functions. We assume that the function h happens to be smoother than 
what is sufficient to be an element of the RKHS 3", that is, if ||S“®/i|| 2 , 2 (dp) ^ 1, where s ^ 1/2. 

The case s = 1/2 corresponds to being in the RKHS. In the proof of Prop. 1 in Appendix B.l, we 
have seen that with high-probability we have: 

(S + A/)-^ ^ 4(S + A/)-^ (14) 

We now see that we can bound the error \\h — (illLjfdp) follows: 

^ A||si/2(S + a/)-i/2||^j|(s + 

We may now bound each term. The first one is less than 2, because of 

Eq. (14). The second one ||(S-|-A/)“^/^S“^/^'''*||^pisequalto ||(S-|-A/)^“^(S-|-A/)^/^“*S“^/^'''^||^p, 
and thus less than ||(S -I- A/)^“^||op • ||(S -|-^ 2A^“^. Overall we obtain 

ll^-/i||L.(rfp)^4A^ 

The norm h i-)- ||S“^/i||x, 2 (rfp) RKHS norm with kernel J2m^o with corre¬ 
sponding eigenvalues equal to From Prop. 2 and 3, the optimal number of quadrature points 

to reach a squared error less than e is proportional to the number max({m, ^ e}), while using 
the quadrature points from s = 1/2, leads to a number max({m, ^ which is equal. 

Thus if the RKHS used to compute the quadrature weights is a bit too large (but not too large, see 
experiments in Section 6), then we still get the optimal rate. Note that this robustness is only shown 
for the regularized estimation of the quadrature coefficients (in our simulations, the non-regularized 
ones also exhibit the same behavior). 

Approximation with stronger norms. We may consider characterizing the difference h — h with 
different norms than || • \\L 2 {dp)^ iri particular norms ||S“’'(/i — /i)||i, 2 (dp)> with r G [0,1/2]. For 
r = 0, this is our results in L 2 -norm, while for r = 1/2, this is the RKHS norms. We have, using 
the same manipulations than above: 

< A^/2-r-||5.1/2-.(^ ^ ^ 2A^/2-r-_ 

When r = 1/2, we get a result in the RKHS norm, but with no decay to zero; the RKHS norm || • || j 
would allow a control in Loo-norm, but as noticed by Steinwart et al. (2009); Mendelson and Neeman 
(2010), such a control may be obtained in practice with r much smaller. For example, when the 
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eigenfunctions are uniformly bounded in Loo-norm by a constant C (as is the case for periodic 
kernels in [0,1] with the uniform distribution), then, for any x G X, we have for t > I, 

oo oo 2 

m=l m=0 


If for simplicity, we assume that = (m+1) (like for Sobolev spaces), we have ||S ^f\\‘j^^(^dp) ' 
Em=i '■(/> em)i 2 (rfp) = ^m)\^{^dp) = f/4s. If A ^ 0(n-2") (as sug¬ 
gested by Prop. 1), then we obtain a squared Loo-error less than = 

O (. With f = 1 -|- we get O (. and thus a degradation compared to the squared 
L 2 -I 0 SS of n (plus additional logarithmic terms), which corresponds to the (non-improvable) result 
of Novak (1988, page 36). 


6. Simulations 

In this section, we consider simple illustrative quadrature experiments"^ with X = [0,1] and kernels 
k{x, y) = 1 + 2TTm{x — y), with various values of s and distributions dp which are 

Beta random variable with the two parameters equal to a = 6 , hence symmetric around 1/2. 

Uniform distribution. For 6 = 1, we have the uniform distribution on [0,1] for which the co¬ 
sine/sine basis is orthonormal, and the optimized distribution q*^ is also uniform. Moreover, we 
have fg k(x, y)dp(x) = 1. We report results comparing different Sobolev spaces for testing func¬ 
tions to integrate (parameterized by s) and learning quadrature weights (parameterized by t) in 
Figure 1, where we compute errors averaged over 1000 draws. We did not use regularization to 
compute quadrature weights a. We can make the following observations: 

- The exponents in the convergence rates for s = f (matching RKHSs for learning quadrature 
weights and testing functions) are close to 2 s as expected. 

- When the functions to integrate are less smooth than the ones used for learning quadrature 
weights (that is f > s), then the quadrature performance does not necessarily decay with the 
number of samples. 

- On the contrary, when s > t, then we have convergence and the rate is potentially worse than 
the optimal one (attained for s = t), and equal when t ^ s/2, as shown in Section 5. 

In Figure 2, we compare several quadrature rules on [0,1], namely Simpson’s rule with uniformly 
spread points, Gauss-Legendre quadrature and the Sobol sequence with uniform weights. For s = 1, 
as expected, all squared errors decay as with a worse constant for our kernel-based rule, while 
for s = 2 (smoother test functions), the Sobol sequence is not adaptive, while all others are adaptive 
and get convergence rates around n“^. 

4. Matlab code for all 5 figures may be downloaded from http ; //www. di . ens . f r / - fbach/quadrature . html. 
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Test functions : s = 1 


Test functions : s = 2 



Test functions : s = 3 





Figure 1: Quadrature for functions in a Sobolev space with parameter s (four possible values) for 
the uniform distribution on [0,1], with quadrature rules obtained from different Sobolev 
spaces with parameters t (same four possible values). We compute affine fits in log-log- 
space (in dotted) to estimate convergence rates of the form C/n^ and report the value of 
u. Best seen in color. 
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Test functions : s = 1 


Test functions : s = 2 



0 0.5 1 1.5 2 

log^o(n) 


0 0.5 1 1.5 2 

log^o(n) 


Figure 2: Quadrature for functions in a Sobolev space with parameters s = 1 (left) and s = 2 
(right), for the uniform distribution on [0,1], with various quadrature rules. We compute 
affine fits in log-log-space (in dotted) to estimate convergence rates of the form and 
report the value of u. Best seen in color. 


Non-uniform distribution. We consider the case a = 6 = 1/2, which is the distribution dp 
with density with respect to the Lebesgue measure, and with cumulative 

distribution function F{x) = 7r“^ arccos(l — 2x). We may use an approximation of dr with N 
unweighted points F~^{k/N) = (l — cos /2, for A: G {1,..., N} and the algorithms from the 
end of Section 4.2. We consider the Sobolev kernel with s = 1. 

In Figure 3, we plot all densities as a function of A. When A is large, we unsuprisingly obtain the 
uniform density, while, more surprisingly, when A tends to zero, the density tends to a density, which 
happens here to be proportional to x^/^{l — x)^/^ (leading to a Beta distribution with parameters 
a = h = .25). 

We may also consider the same kernel but with the Fourier expansion on N. This is done by repre¬ 
senting dr oc 5o + Yhk&z* truncating to all |A;| ^ K, with K = 50, which is a weighted 

representation. We plot in Figure 4 the optimal density over the set of integers, both with respect to 
the input density (which decays as 1/n?) and the counting measure. When A is large, we recover 
the input density, while when A tends to zero, q\ tends to be uniform (and thus, does not converge 
to a finite measure). 


7. Conclusion 

In this paper, we have shown that kernel-based quadrature rules are a special case of random feature 
expansions for positive definite kernels, and derived upper and lower bounds on approximations, 
that match up to logarithmic terms. For quadrature, this leads to widely applicable results while 
for random features this allows a significantly improved guarantee within a supervised learning 
framework. 
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Figure 3: Optimal log-densities g^(x) (with respect to the input distribution) for several values of 
A, for the expansion used for quadrature. Best seen in color. 



Figure 4: Optimal densities q^{k) for several values of A, for Fourier feature expansions. Left: with 
respect to the input distribution (which itself has distribution proportional to l/k‘^ with 
respect to the counting measure); right: with respect to the counting measure. Best seen 
in color. 
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The present work could be extended in a variety of ways, for example towards bandit optimization 
rather than quadrature (Srinivas et ah, 2012), the use of quasi-random sampling within our frame¬ 
work in the spirit of Yang et al. (2014); Oates and Girolami (2015), a similar analysis for kernel 
herding (Chen et ah, 2010; Bach et ah, 2012), an extension to fast rates for non-parametric least- 
squares regression (Hsu et ah, 2014) but with an improved computational complexity, and a study 
of the consequences of our improved approximation result for online learning and stochastic ap¬ 
proximation, in the spirit of Dai et al. (2014); Dieuleveut and Bach (2014). 
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Appendix A. Kernels on product spaces 

In this appendix, we consider sets X which are products of several simple sets Xi,..., Xd, with 
known kernels ki, ... ,kd, each with RKHS 3“i,..., Xd- We also assume that we have d measures 
dpi, ..., dpd, leading to sequences of eigenvalues {pjmj)mj^i and eigenfunctions {ejmj)mj^i- 

Our aim is to define a kernel A: on X = Xi x • • • x with the product measure dp = dpi ■ ■ ■ dpd- 
For illustration purposes, we consider decays of the form pm oc for the d kernels, that will 

be useful for Sobolev spaces. We also consider the case where oc exp(—pm). For some 
combinations, eigenvalue decay is the most natural, in others, the number of eigenvalues m*{X) 
greater than a given A > 0 is more natural. 

A.l Sum of kernels: k{x,y) = Yl'j=if^ji^j^yj) 

In this situation, the RKHS for k is isomorphic to Xi x • • • x Xd, composed of functions g such 
that there exists /i,..., /^ in Xi,..., such that g{x) = that is we obtain separable 

functions, which are sometimes used in the context of generalized additive models (Hastie and Tibshirani, 
1990). The corresponding integral operator is then block-diagonal with j-th block equal to the in¬ 
tegral operator for kj and dpj. This implies that that its eigenvalues are the concatenation of all 
sequences {pjmj)mj^o- Thus the function m*(A) is the sum of functions m^(A) + • • • + m^(A). 

In terms of norms of functions, we have a norm equal to ||p||| = J2j=i II/? II j • 

In the particular case where pjmj oc for all j, or equivalently, a number of eigenvalues of 

kj greater than A proportional to we have a number of eigenvalues of k greater than A 

equivalent to that is a decay for the eigenvalues proportional to {m/d)~‘^^. Similarly, 

when the decay is exponential as exp(—pm), we get a decay of exp(—pm/d). 


A.l Product of kernels: k{x,y) = Y['j=i^ji^j^yj) 


In this situation, the RKHS for K is exactly the tensor product of Xi,..., Xd, i.e., the span of all 
functions 0^=1 fji^j)’ for /i,..., /d in Xi,..., X^ (Berlinet and Thomas-Agnan, 2004). More¬ 
over, the integral operator for A: is a tensor product of the d integral operator for A:i,..., A:^. This 
implies that its eigenvalues are pimi x • • • x pdmd’ • • • > ^ 0- Iii terms of norms of functions 

defined on Xi x • • • x X^, this thus corresponds to 

( d ^ -I / d \ 2 


d 

f,fle 

i=i 


jm. 


{Xj 


/ L2{dp®^) 


Special cases. In the particular case where pjmj oc m- for all j, we have a number of eigenval¬ 
ues of k greater than A equivalent to the number of multi-indices such that mi x • • • x nid is less 
than By counting first the index mi, this can be upper-bounded by the sum of 

over all indices m 2 ,..., rud less than which is less than ^ ^ ~ 

(s log This results in a decay of eigenvalues bounded by (log 

(this can be obtained by inverting approximately the function of A). 
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When the decay is exponential as exp(—pA), then we get that m* (A) is the number of multi-indices 
(mi,..., rrid) such that their sum is less than c = —when c is large, this is equivalent to 

times the volume of the d-dimensional simplex, and thus less than ^ This leads to a 

decay of eigenvalues as exp(—or, by using Stirling formula, less than exp(— 


Appendix B. Proofs 


B.l Proof of Prop. 1 


As shown in Section 2.2, any / G 3“ with 3“-norm less than one may be represented as / = 
g(v)(p(v, ■)dT{v), for a certain g G L2{dT) with L 2 (dr)-norm less than one. We do not solve the 
problem in (3 exactly, but use a properly chosen Lagrange multiplier A and consider the following 
minimization problem: 


n « 

- / ‘P{v,-)9{v)dT{v] 

i=l 


+ n 


L2{dp) 


We consider the operator $ : M” —)• L2 {dp) such that 

n 

^13 = /3iQ{vi)~^^^(p{vi, •)• 

i=l 

We then need to minimize the familiar least-squares problem: 

with solution from the usual normal equations and the matrix inversion lemma for operators (Ogawa, 
1988): 

/3 = (15) 

n n 

We consider the empirical integral operator S : L 2 {dp) L 2 {dp), defined as 

1 1 ” 1 


tU ^ ■ f u ^ T \ / ¥^(^*5 ■))L2(dp)(^5 ■))L2(rfp) ^ 

that IS, tor a, 0 G L 2 {dp), (a, Eo) 2 , 2 ((ip) = > - 7—7 -■ By construe- 


2=1 


(l{vi) 


tion, and following the end of Section 2.2, we have ES = S. 


The value of ||/- is equal 


to 


Wf-^PWUdp) 


11 / - S(S + A/)-VllL(.p) = l|A(S + A/)-VllL(rfp) 

A2(/, (S + A/)-V)^,(,,) ^ A(/, (S + A/)-V)^,(,,), (16) 
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because (S + XI) ^ ^ A + XI) ^ (with the classical partial order between self-adjoint oper¬ 
ators). 

Finally, we have, with /? = -|- XI)~^f: 

nml = ((S + A/)-V, S(S + A/)-V)i,(,,) ^ </, (S + A/)-V)^,(,,), (H) 

using (S + XI)-^t ^ (S + XI)-\ 

By construction, we have E(S) = S. Moreover, we have, by Cauchy-Schwarz inequality: 


(o, (/ ®L 2 {dp) f)(^)L 2 (dp) 


€ 


/ a{x)f{x)dp{x)] ={ a{x)g{v)(p{v,x)dT{v)dp{x) 

lx J \JxJv 

9i'vfdT{v)'^jj^ j^{x)ip{v, x)dp(x)^ dT(v) 


Thus / ®L 2 {dp) / ^ 5], and we may thus define (/, S ^/)L 2 (dp)> which is less than one. 

Overall we aim to study (/, (S -|- A/)“^/)i 2 (dp)’ (/> f)L 2 idp) ^ 1> to control both the norm 

II/ 3 II 2 in Eq. (17) and the approximation error ||/ — in Eq. (16). We have, following a 

similar argument than the one of Bach (2013); El Alaoui and Mahoney (2014) for column sampling, 
i.e., by a formulation using S — S in terms of operators in an appropriate way: 


(/,(S + A/)-V)L 2 (dp) 

= (/,(S + A7 + S-S)-V)L2(dp) 

= ((S + A/)-V2/, [/ + (S + A/)-V2(s - S)(S + A/)-V2] -i(s + A/)-V2/)^^^^^^. 


Thus, if (S -|- XI) — S)(S -|- XI) )>= —tl, with t G (0,1), we have 

(/,(E + A/)-V)L2(dp) ^ ((S + A/)-V2/,(i_t)-i(s + A/)-V2/)^^(,^) 

= (l-t)-i(/,(S + A/)-V)L2(dp) 

^ (1 -t)”^(/,S-V)L 2 (rfp) ^ (1 


Moreover, we have shown (S -|- XI) ^ =4 + A/) 

Thus, the performance depends on having (S -|- A/)“^/^(S — S)(S -|- A/)“^/^ ^ tl. 

We consider the self-adjoint operators Xi, for f = 1,..., n, which are independent and identically 
distributed: 

X, = i(S + A/)-iS - [(S + A/)-V 2 ^(^.,.)] [(s + A/)-V 2 <^(„.,.)], 

71 71 Q\Vi ) 

so that our goal is to provide an upperbound on the probability that || ^*11 op > where || • ||op 

is the operator norm (largest singular values). We use the notation 


(i = trS(S + A/)"^ = 


/ 


{ip{v,-),{T. + XI) V(r;,-))L 2 (dp) 
q{v) 


q{v)dT{v) ^ dn 
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We have 


KXi = 0, by construction of Xi, 

^-(S + A/)"^S ^ - tr [(S + A/)"^Sl/ ^ 
f 71 ^ 


X- 


n 


n 


^- 


1 1 
n 
1 1 
n 


1 1 op 


^max 

-as a 

n 

e(x7 


E 

1 1 


nq{vi) 



E 

'1 1 


nq(vi) 


^[(S + A/)-'/V(^*,-)] ®L3(dp) [(S + A/)”i/V(^*,-)] 
as a consequence of the two previous inequalities, 

"I 2 

-^[(S + A/)-VVK,-)] ®L3(dp) [(S + A/)-i/V(u,,-)] - [-(S + A/)-iS]‘ 

77 . ^^ 772 ) J ?7 








E- 

dr 


•), (S + A/) L2(dp) \ , 0 , I \ r^-l/2 ( ^l 

--[(S + A/) ! ^{Vi, OJ ®L2{dp) [(S + A/) ! ^{Vi, OJ 


^E([^(S + A/)-i/V(r;i,-)] ®L2{dp) [(E + A/)-VV(u„-)]) = ^S(S + A/)-i, 


n 

dn 


i=l 


n 


KS + A/)"^S 


with a maximal eigenvalue less than and a trace less than tr S(S + A/) ^ ^ _ 

n n n 

Following Hsu et al. (2014), we use a matrix Bernstein inequality which is independent of the un¬ 
derlying dimension (which is here infinite). We consider the bound of Minsker (2011, Theorem 
2.1), which improves on the earlier result of Hsu et al. (2012, Theorem 4), that is: 


P 


Ex 

2=1 


op 


> f ^ 2d 1 + 


log^(l + nf/dmax) 


exp 


We now consider f = |, 5 € (0) 1)> n ^ Hdmaxlog—-—, with appropriate constants 

^ 0 

B,C > 0. This implies that 

t^l 2 


t ^/2 


dmax/?T'(l + f/3) 


exp - 


, (3/4)72 Cd^ 


dmax/n(l + f/3) 

and, if dmax ^ D, using n ^ Hdmax log CD, 

6 


§ (3/4)^B/2 ^ (3/4)2b/2 

^ (-) E4 ^ (-) 574 , 

^Cdr...J ^Cd’ 


1 + 


^ 1 + 


6-16/9 


D log^(l + nt/dyns,x) ^ ^ log^ (1 + (3H/4) log(CiA)) ’ 

while if dmax ^ D and n ^ 1, 

6 .. 6-16/9 


1 + 


^ 1 + 


D log^(l + nt/dmax) ^ log^ (1 + (3/4T>)) 
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In order to get a bound, we need ^ 1, and we can take B = 5. If we take (7 = 8 , then in 

order to have 1 + 75 -;— ^ ,,, —r ^ 4, we can take D = 3/8. Thus the probability is less than 6 . 

t 2 log"^(l+nt/dmax) ' t' J 

Finally, in order to get the extra bound on ^ ■)lli 2 (rfp)’ we consider Etr S = 

tr S = k(x, x)dp(x), and thus, by Markov’s inequality, with probability 1 — S, 

1 ^ X 1 

=trS ^ -trS. (18) 

2=1 

By taking <5/2 instead of 6 in the control of || -^illop > t and in the Markov inequality above, 
we have a control over ||/ 3 || 2 , tr S and the approximation error, which leads to the desired result in 
Prop 1. This will be useful for the lower bound of Prop. 3. 

We can make the following extra observations regarding the proof: 

- It may be possible to derive a similar result with a thresholding of eigenvalues in the spirit 
of Zwald et al. (2004), but this would require Bernstein-type concentration inequalities for 
the projections on principal subspaces. 

- We have seen that with high-probability, we have (S -|- A/)“^ ^ 4(S -|- \I)^^. Note that 
A ^ B does not imply in ^ (Bhatia, 2009, page 9) and that in general we do not have 
(S -I- A/)“^ ^ (7(S -|- A/)“^ for any constant C (which would allow an improvement in the 
error by replacing A by A^, and violate the lower bound of Prop. 3). 

- We may also obtain a result in expectation, by using 5 = 4A/tr S (which is assumed to 

be less than 1 ), leading to a squared error with expectation less than 8 A as soon as n ^ 
5dmax(A) log ^ g^j^ ygg {j^g bound 4A with probability 1 — (i and 

||/|||^(^^) ^ tr S with probability <5, leading to a bound of 4A(1 — <)) -|- <5 tr S ^ 8 A. We use 
this result in Section 4.5. 


B.2 Proof of Prop. 2 


We start from the bound above, with the constraint n ^ 5d(A) log , Statement (a) is a simple 


reformulation of Prop. 1. For statement (b), if we assume m ^ 


. I67 


-, and A = pm, then we 


5(1+7) log 

have d{X) ^ (1 -|- 7 )m, which implies n ^ bd{X) log and (b) is a consequence of (a). 


B.3 Proof of Prop. 3 

We first use the Varshamov-Gilbert’s lemma (see, e.g., Massart, 2003, Lemma 4.7). That is, for any 
integer s, there exists a family { 6 j)j^j of at least | J| ^ distinct elements of {0,1}*, such that 
for j // G J, WOj -OfWl ^ f. 

Foreach0 G {0,1}*, we define an element of T with norm less than one, as/(0) = G 

T, where {ei,pi), i = 1 ,..., s are the eigenvector/eigenvalue pairs associated with the s largest 
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eigenvalues of S. We have, since ^ for i G {1,..., s} and \\0\\\ ^ 1: 


I = V E ".Vr'< ^ E i E 1 

2=1 2=1 


S ^ 

2=1 


Moreover, for any j / / G J, we have ||/(0j) - /(0j')lli2(dp) = ^ll^j “ Gy\\l ^ 

We now assume that s is selected so that \/4A ^ </^/3. By applying the existence results to all 


functions fj, j G J, then there exists a family {I3j)j^j of elements of M”, with squared ^ 2 -norm 


less than -, and for which, for all j, 

71. ^ ’ 


fj - 


2=1 


L 2 (dp) 


^ VIx. 


This leads to, for any j / j' G J, 


'^{f^j ^ II/?' fj'\\L2{dp) II fj 


i=l 


i=l 


L2{dp) 


- fj 


2 = 1 


L 2 (dp) 




Moreover, we have the bound 
2 


^(/3j - 


i=l 




L2{dp) 


Y.^^j - ^ Wj-ml •n(25-itrS). 

i=l ^ i=l 


Combining the last two inequalities, we get Wfjj—fiji II 2 ^ Thus, e®/® is less than the 

A-packing number of the ball of radius r = 2/-y/n, which is itself less than (r/A)"^(2 +A/r)"^ (see, 

e.g., Massart, 2003, Lemma 4.14). Since A/r = \J ^ 12 ^/ 2 ’ 


- ^ n( - log 


1 , 4 • 72 tr S 


5jis 




This implies n ^ 


4 log 1^+29 


. Given that we have to choose ^ 144A for the result to hold, this 


implies the desired result, since 41og(1440) ^ 29. 
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